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Abstract. We study and provide efficient algorithms for multi-objective model checking 
problems for Markov Decision Processes (MDPs). Given an MDP, M, and given multiple 
linear-time (u-regular or LTL) properties ipi, and probabilities r; £ [0, 1], i = 1, . . . , k, we 
ask whether there exists a strategy a for the controller such that, for all i, the probability 
that a trajectory of M controlled by a satisfies <pi is at least r^. We provide an algorithm 
that decides whether there exists such a strategy and if so produces it, and which runs in 
time polynomial in the size of the MDP. Such a strategy may require the use of both ran- 
domization and memory. We also consider more general multi-objective oj-regular queries, 
which we motivate with an application to assume-guarantee compositional reasoning for 
probabilistic systems. 

Note that there can be trade-offs between different properties: satisfying property 
with high probability may necessitate satisfying tfi2 with low probability. Viewing this as 
a multi-objective optimization problem, we want information about the "trade-off curve" 
or Pareto curve for maximizing the probabilities of different properties. We show that one 
can compute an approximate Pareto curve with respect to a set of tj-regular properties in 
time polynomial in the size of the MDP. 

Our quantitative upper bounds use LP methods. We also study qualitative multi- 
objective model checking problems, and we show that these can be analysed by purely 
graph-theoretic methods, even though the strategies may still require both randomization 
and memory. 
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Figure 1: An MDP with two objectives, OPi and OP2, and the associated Pareto curve. 

1. Introduction 

Markov Decision Processes (MDPs) are standard models for stochastic optimization 
and for modelhng systems with probabihstic and nondeterministic or controlled behavior 
(see |Put94l IVar85l ICY951 ICYOSj ). In an MDP, at each state, the controller can choose 
from among a number of actions, or choose a probability distribution over actions. Each 
action at a state determines a probability distribution on the next state. Fixing an initial 
state and fixing the controller's strategy determines a probability space of infinite runs 
(trajectories) of the MDP. For MDPs with a single objective, the controller's goal is to 
optimize the value of an objective function, or payoff, which is a function of the entire 
trajectory. Many different objectives have been studied for MDPs, with a wide variety of 
applications. In particular, in verification research linear-time model checking of MDPs has 
been studied, where the objective is to maximize the probability that the trajectory satisfies 
a given w-regular or LTL property f |CY98l ICY951 IvIj85] 1. 

In many settings we may not just care about a single property. Rather, we may have 
a number of different properties and we may want to know whether we can simultaneously 
satisfy all of them with given probabilities. For example, in a system with a server and two 
clients, we may want to maximize the probability for both clients 1 and 2 of the temporal 
property: "every request issued by client i eventually receives a response from the server" , 
i = 1,2. Clearly, there may be a trade-off. To increase this probability for client 1 we 
may have to decrease it for client 2, and vice versa. We thus want to know what are 
the simultaneously achievable pairs (pi,P2) of probabilities for the two properties. More 
specifically, we will be interested in the "trade-off curve" or Pareto curve. The Pareto curve 
is the set of all achievable vectors p = (pi,P2) G [0; 1]^ such that there does not exist another 
achievable vector p' that dominates p, meaning that p < p' (coordinate-wise inequality) and 
p^p'. 

Concretely, consider the very simple MDP depicted in Figure [H Starting at state s, 
we can take one of three possible actions {01,02,03}. Suppose we are interested in LTL 
properties OPi and OP2- Thus we want to maximize the probability of reaching the two 
distinct vertices labeled by Pi and P2, respectively. To maximize the probability of OPi 
we should take action oi, thus reaching Pi with probability 0.6 and P2 with probability 0. 
To maximize the probability of OP2 we should take 02, reaching P2 with probability 0.8 
and Pi with probability 0. To maximize the sum total probability of reaching Pi or P2, we 
should take 03, reaching both with probability 0.5. Now observe that we can also "mix" 
these pure strategies using randomization to obtain any convex combination of these three 
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value vectors. In the graph on the right in Figured! the dotted Une plots the Pareto curve 
for these two properties. 

The Pareto curve V in general contains infinitely many points, and it can be too costly 
to compute an exact representation for it (see Section [2|) . Instead of computing it outright 
we can try to approximate it ( [PYOOj ) . An e- approximate Pareto curve is a set of achievable 
vectors 7-'(e) such that for every achievable vector r there is some vector t G 'P(e) which 
"almost" dominates it, meaning r < (1 + e)t. 

In general, given a labeled MDP M, k distinct w-regular properties, ^ = {ipi j i = 
1, . . . ,k), a start state u, and a strategy a, let Pr^((^j) denote the probability that starting 
at n, using strategy a, the trajectory satisfies ipi. For a strategy a, define the vector 
= (if, . . . , t^), where tf = Pr^((^j), for i = 1, . . . , A;. We say a value vector r E [0, l]'^ is 
achievable for $, if there exists a strategy a such that t^ >r. 

We provide an algorithm that given MDP M, start state n, properties <I>, and rational 
value vector r G [0, 1]*^, decides whether r is achievable, and if so produces a strategy a such 
that > r. The algorithm runs in time polynomial in the size of the MDP. The strategies 
may require both randomization and memory. Our algorithm works by first reducing the 
achievability problem for multiple w-regular properties to one with multiple reachability 
objectives, and then reducing the multi-objective reachability problem to a multi-objective 
linear programming problem. We also show that one can compute an e-approximate Pareto 
curve for $ in time polynomial in the size of the MDP and in 1/e. To do this, we use 
our linear programming characterization for achievability, and use results from [ PYOO] on 
approximating the Pareto curve for multi-objective linear programming problems. 

We also consider more general multi-objective queries. Given a boolean combination 
B of quantitative predicates of the form Pr^((^j)Ap, where A E {<,>,<,>,=, 7^}, and 
p E [0, 1], a multi-objective query asks whether there exists a strategy a satisfying B (or 
whether all strategies a satisfy B). It turns out that such queries are not really much more 
expressive than checking achievability. Namely, checking a fixed query B can be reduced to 
checking a fixed number of extended achievability queries, where for some of the coordinates 
tf we can ask for a strict inequality, i.e., that t^ > r^. (In general, however, the number 
and size of the extended achievability queries needed may be exponential in the size of B.) 
A motivation for allowing general multi-objective queries is to enable assume- guarantee 
compositional reasoning for probabilistic systems, as explained in Section [2j 

Whereas our algorithms for quantitative problems use LP methods, we also consider 
qualitative multi-objective queries. These are queries given by boolean combinations of 
predicates of the form Pr^((/7j)A6, where b E {0, 1}. We give an algorithm using purely 
graph-theoretic techniques that decides whether there is a strategy that satisfies a qualita- 
tive multi-objective query, and if so produces such a strategy. The algorithm runs in time 
polynomial in the size of the MDP. Even for satisfying qualitative queries the strategy may 
need to use both randomization and memory. 

In typical applications, the MDP is far larger than the size of the query. Also, lo- 
regular properties can be presented in many ways, and it was already shown in |CY95] 
that the query complexity of model checking MDPs against even a single LTL property is 
2EXPTIME-complete. We remark here that, if properties are expressed via LTL formulas, 
then our algorithms run in polynomial time in the size of the MDP and in 2EXPTIME in 
the size of the query, for deciding arbitrary multi-objective queries, where both the MDP 
and the query are part of the input. So, the worst-case upper bound is the same as with 
a single LTL objective. However, to keep our complexity analysis simple, we focus in this 
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paper on the model complexity of our algorithms, rather than their query complexity or 
combined complexity. 

Related work. Model checking of MDPs with a single w-regular objective has been studied 
in detail (see |CY98l [CY951 IVarSSj ). In [CY98j . Courcoubetis and Yannakakis also consid- 
ered MDPs with a single objective given by a positive weighted sum of the probabilities of 
multiple w-regular properties, and they showed how to efficiently optimize such objectives 
for MDPs. They did not consider tradeoffs between multiple w-regular objectives. We 
employ and build on techniques developed in |CY98] . 

Multi-objective optimization is a subject of intensive study in Operations Research 
and related fields (see, e.g., |Ehr05l ICh97| ). Approximating the Pareto curve for general 
multi-objective optimization problems was considered by Papadimitriou and Yannakakis in 
[PYOOj . Among other results, [PYOOj showed that for multi-objective linear programming 
(i.e., linear constraints and multiple linear objectives), one can compute a (polynomial sized) 
e-approximate Pareto curve in time polynomial in the size of the LP and in 1/e. 

Our work is related to recent work by Chatterjee, Majumdar, and Henzinger ( [CMHOB] ). 
who considered MDPs with multiple discounted reward objectives. They showed that ran- 
domized but memoryless strategies suffice for obtaining any achievable value vector for these 
objectives, and they reduced the multi-objective optimization and achievability (what they 
call Pareto realizability) problems for MDPs with discounted rewards to multi-objective 
linear programming. They were thus able to apply the results of [PYOOj in order to ap- 
proximate the Pareto curve for this problem. We work in an undiscounted setting, where 
objectives can be arbitrary w-regular properties. In our setting, strategies may require both 
randomization and memory in order to achieve a given value vector. As described earlier, 
our algorithms first reduce multi-objective (j-regular problems to multi-objective reacha- 
bility problems, and we then solve multi-objective reachability problems by reducing them 
to multi-objective LP. For multi-objective reachabilility, we show randomized memoryless 
strategies do suffice. Our LP methods for multi-objective reachability are closely related 
to the LP methods used in [CMH06j (and see also, e.g., |Put94] . Theorem 6.9.1., where a 
related result about discounted MDPs is established). However, in order to establish the 
results in our undiscounted setting, even for reachability we have to overcome some new 
obstacles that do not arise in the discounted case. In particular, whereas the "discounted 
frequencies" used in [CMH06] are always well-defined finite values under all strategies, the 
analogous undiscounted frequencies or "expected number of visits" can in general be infinite 
for an arbitrary strategy. This forces us to preprocess the MDPs in such a way that ensures 
that a certain family of undiscounted stochastic flow equations has a finite solution which 
corresponds to the "expected number of visits" at each state-action pair under a given 
(memoryless) strategy. It also forces us to give a quite different proof that memoryless 
strategies suffice to achieve any achievable vector for multi-objective reachability, based on 
the convexity of the memory lessly achievable set. 

Multi-objective MDPs have also been studied extensively in the OR and stochastic 
control literature (see e.g. |Fur80[ IWhi82l IHen83l IGho90[ IWT98j l. Much of this work is 
typically concerned with discounted reward or long-run average reward models, and does not 
focus on the complexity of algorithms. None of this work seems to directly imply even our 
result that for multiple reachability objectives checking achievability of a value vector can 
be decided in polynomial time, not to mention the more general results for multi-objective 
model checking. 
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2. Basics and background 

A finite-state MDP M = (V, T, S) consists of a finite set V of states, an action alphabet 
r, and a transition relation 6. Associated with each state u is a set of enabled actions 
Ti; Q T- The transition relation is given hy S C x F x [0, 1] x y. For each state 
V (zV, each enabled action 7 G F^,, and every state v' € V, we have at most one transition 
,v') € 5, for some probability P(^^^y) G (0,1], such that Ylv'^v P{v,'y,v') — 1- 
When there is no transition {y,^,p^^^^y^,v'), we may, only for notational convenience, 
sometimes assume that there is a probability transition, i.e., that P{v^'y^v') = 0- (But 
such redundant probability transitions are not part of the actual input.) Thus, at each 
state, each enabled action determines a probability distribution on the next state. There 
are no other transitions, so no transitions on disabled actions. We assume every state v has 
some enabled action, i.e., F^, 7^ 0, so there are no dead ends. For our complexity analysis, 
we assume of course that all probabilities P{y^^^yi) are rational. There are other ways to 
present MDPs, e.g., by separating controlled and probabilistic nodes into distinct states. 
The different presentations are equivalent and efficiently translatable to each other. 

A labeled MDP M = (V, F, 5, 1) has, in addition, a set of propositional predicates 
Q = {Qii ■ ■ ■ ^Qr} which label the states. We view this as being given by a labelling 
function I : V ^ where S = 2*^. We define the encoding size of a (labeled) MDP M, 
denoted by |M|, to be the total size required to encode all transitions and their rational 
probabilities, where rational values are encoded with numerator and denominator given in 
binary, as well as all state labels. 

For a labeled MDP M = (F, F, 5, 1) with a given initial state u ^V, which we denote 
by Mu, runs of are infinite sequences of states tt = ttotti . . . G , where ttq = m and 
for all ^ > 0, TTj G y and there is a transition (tTj, 7,p, TTj+i) G 5, for some 7 G T^r^ and some 
probability p > 0. Each run induces an w-word over S, namely 1{tt) = 1{'Kq)1{'K\) . . . G S'^. 

A strategy is a function a : (FF)*y ^(T), which maps a finite history of play to 
a probability distribution on the next action. Here I'(F) denotes the set of probability 
distributions on the set F. Moreover, it must be the case that for all histories wu, a{wu) G 
V{Tu), i.e., the probability distribution has support only over the actions available at state 
u. A strategy is pure if a(wu) has support on exactly one action, i.e., with probability 
1 a single action is played at every history. A strategy is memoryless (stationary) if the 
strategy depends only on the last state, i.e., if a{wu) = a{w'u) for all w,w' G (VT)*. If 
a is memoryless, we can simply define it as a function a : V ^ ^(T)- An MDP M with 
initial state u, together with a strategy a, naturally induces a Markov chain M^, whose 
states are the histories of play in M^, and such that from state s = if 7 G F^, there is 
a transition to state s' = wvjv' with probability a{wv){'^i') ■ Pi^y^^^^')- A run 9 in is thus 
given by a sequence 9 = 9q9i . . ., where 9q = u and each 6i G {Vr)*V, for all i > 0. We 
associate to each history 9i = wv the label of its last state v. In other words, we overload 
the notation and define l{wv) = l{v). We likewise associate with each run 9 the w-word 
l{9) = 1{6q)1{6i) . . .. Suppose we arc given an LTL formula or Biichi automaton, or any 
other formalism for expressing an a;-regular language over alphabet E. Let L{}p) C E'^ 
denote the language expressed by (p. We write Pr^((/?) to denote the probability that a 
trajectory 9 of satistifies (p, i.e., that l{9) G L{(p). For generality, rather than just 
allowing an initial vertex u we allow an initial probability distribution a G T>iy). Let 
Pr^((^) denote the probability that under strategy a, starting with initial distribution a, 
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we will satisfy w-regular property These probabilities are well defined because the set of 
such runs is Borel measurable (see, e.g., |Var851 ICY95j ). 

As in the introduction, for a /c-tuple of w-regular properties $ = ((^i, . . . , (^fc), given 
a strategy a, we let = (q, 1%), with tf = Pr^((/?i), for i = 1, . . . , fc. For MDP M 
and starting state n, we define the achievable set of value vectors with respect to $ to 
be Umu,^ = {r £ M>q | 3a such that t"^ > r}. For a set [/ C R*^, we define a subset 
"P C ^7 of it, called the Pareto curve or the Pareto set of U, consisting of the set of Pareto 
optimal (or Pareto efficient) vectors inside U. A vector v (z U is called Pareto optimal 
if -i3f'(?j' G U A V < v' A V ^ v'). Thus V = {v U \ v is Pareto optimal}. We use 
'Pmu,^ ^ f^M„,$ to denote the Pareto curve of Umu,^- 

It is clear, e.g., from Figured! that the Pareto curve is in general an infinite set. In fact, 
it follows from our results that for general w-regular objectives the Pareto set is a convex 
polyhedral set. In principle, we may want to compute some kind of exact representation of 
this set by, e.g., enumerating all the vertices (on the upper envelope) of the polytope that 
defines the Pareto curve, or enumerating the facets that define it. It is not possible to do 
this in polynomial-time in general. In fact, the following theorem holds: 

Theorem 2.1. There is a family of MDPs, {M{n) \ n E N), where M{n) has n states and 
size 0{n), such that for M[n) the Pareto curve for two reachability objectives, OPi and 
OP2, contains n'^Cos") vertices (and thus n^(^°^'^) facets). 

Proof. We will adapt and build on a known construction for the bi-objective shortest path 
problem which shows that the Pareto curve for that problem can have n^*-^°^"^ vertices. 
This was shown in |Car83] and a simplified proof (using a similar construction) was given in 
|MS01| . (The constructions and theorems there are phrased in terms of parametric shortest 
paths, but these are equivalent to bi-objective shortest paths.) What those constructions 
show is that, for some polynomial /, and for every n, there is a graph G„ with /(n) nodes 
and distinguished nodes s and t, and such that every edge {u, v) has two (positive) costs 
c{u,v) and d{u,v), which yield two cost functions c(-) and d(-) on the s-t paths, such that 
the Pareto curve of the s-t paths under the two objectives has n^*^^°§"^ vertices (and edges). 
An important property of the constructed graphs G„ is that they are acyclic and layered, 
that is, the nodes are arranged in layers Lq = s, Li,L2, . . . ,Ln = t, and all edges are from 
layer Lj to Li+i for some i G {0, . . . , n — 1}. 

Building on this construction, we now construct the following instance M„ of the MDP 
problem with two reachability objectives. The states of M„ are the same as G„ with 2 extra 
absorbing states: the red state R, and the blue state B, which are the two target states of 
our two reachability objectives. For each state u there is one action for each outgoing edge 
{u,v); if we choose this action then we transition with probability r{u,v) to state R, with 
probability b{u,v) to B, with probability 1/2 to v, and with the remaining probability to 
t. The probabilities r{u,v) and b{u,v) are defined as follows. Let h be the maximum c or 
d cost over all the edges. For an edge {u,v) where u & Li (and v G -^^i+i), set 

2^{2h-c{u,v)) 



r{u, V 
and 

6(n, V 



8/i2" 
2'(2/i - d{u,v)) 



8/i2" 

Note that both these quantities are in the interval [0,1/4], so all probabilities are well- 
defined. 



MULTI-OBJECTIVE MODEL CHECKING OF 



MARKOV DECISION PROCESSES * 



7 



The claim is that there is a 1-1 correspondence between the vertices of the Pareto curve 
of this MDP Mn and the Pareto curve of the bi-objective shortest path on Gn- First we note 
that the vertices of the Pareto curve for the MDP correspond to pure memoryless strategies 
(meaning that for each vertex of the Pareto curve a pure memoryless strategy can achieve 
the value vector that the vertex defines). The reason for this is that the vertices are optima 
for a linear combination of the two objectives, and it follows from the proof of Theorem [321 
which we shall show later, that these objectives have pure memoryless optimal strategies. 

A pure strategy corresponds to a path from s to t. Let vr = s, ui,U2-, Un-it be such a 
path/strategy. The probability that this strategy leads to the red node R is r(s, ui) + . . . + 
Pro6(reach node Ui) * r{ui,Ui + 1) + . . . The probability that the process reaches node Ui 
under the strategy tt is 1/2*, independent of the path. Thus, Pto^tt (reach R) = a — b* c(7r), 
where a, b are constants independent of the path. Similarly, Pro^Tr (reach B) = a — b* d{TT). 

It follows that minimizing the c and d costs of the paths is equivalent to maximizing the 
probabilities of reaching R and B, and this also holds for any positive linear combination of 
the two respective objectives. Thus, there is a correspondence between their Pareto curves. 

□ 

So, the Pareto curve is in general a polyhedral surface of super polynomial size, and 
thus cannot be constructed exactly in polynomial time. We show, however, that the Pareto 
set can be efficiently approximated to any desired accuracy e > 0. An e- approximate Pareto 
curve, VM„,^{e) ^ Um„,^, is any achievable set such that Vr G Um^,^ 3t G Vmu,^{^) such 
that r < (1 + e)t. When the subscripts and ^ are clear from the context, we will drop 
them and use U , V, and 'P{e) to denote the achievable set, Pareto set, and e-approximate 
Pareto set, respectively. 

We also consider general multi- objective queries. A quantitative predicate over w-regular 
property ipi is a statement of the form Pr^((/3j)Ap, for some rational probability p G [0, 1], 
and where A is a comparison operator A G {<,>,<,>,=}. Suppose S is a boolean 
combination over such predicates. Then, given M and u, and B, we can ask whether there 
exists a strategy a such that B holds, or whether B holds for all a. Note that since B can 
be put in DNF form, and the quantification over strategies pushed into the disjunction, and 
since w-regular languages are closed under complementation, any query of the form 3aB (or 
of the form Vcri?) can be transformed to a disjunction (a negated disjunction, respectively) 
of queries of the form: 

3a f\{FTUv>^) > n) A > r'^) (2.1) 

* j 

We call queries of the form (1) extended achievability queries. Thus, if the multi- 
objective query is fixed, it suffices to perform a fixed number of extended achievability 
queries to decide any multi-objective query. Note, however, that the number of extended 
achievability queries we need could be exponential in the size of B. We do not focus on 
optimizing query complexity in this paper. 

A motivation for allowing general multi-objective queries is to enable assume- guarantee 
compositional reasoning for probabilistic systems. Consider, e.g., a probabilistic system 
consisting of the concurrent composition of two components. Mi and M2, where output 
from Ml provides input to M2 and thus controls M2. We denote this by Mi [> M2. M2 
itself may generate outputs for some external device, and Mi may also be controlled by 
external inputs. (One can also consider symmetric composition, where outputs from both 
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Figure 2: The MDP M'. 



components provide inputs to both. Here, for simphcity, we restrict ourselves to asymmetric 
composition where Mi controls M2.) Let M be an MDP with separate input and output 
action alphabets Si and S2, and let (pi and (p2 denote w-regular properties over these two 
alphabets, respectively. We write {(pi)->r^M{ip2)->r2i to denote the assertion that "z/ the 
input controller of M satisfies tpi with probability > ri, then the output generated by M 
satisfies (p2 with probability > r2 ". Using this, we can formulate a general compositional 
assume-guarantee proof rule: 



Thus, to check {ipi)>riMit> M2{(pz)>rz it suffices to check two properties of smaller systems: 
{Vi)>rxMi{ip2)>T2 s-i^d (v'2)>r2-^2(V3)>r3- Note that checking {ipi)>riM{ip2)>r2 amounts 
to checking that there does not exist a strategy a controlling M such that Pr^((/?i) > ri 
and Pr^(932) < r2. 

We also consider qualitative multi-objective queries. These are queries restricted so that 
B contains only qualitative predicates of the form Pr^((/5j)A5, where b € {0, 1}. These can, 
e.g., be used to check qualitative assume-guarantee conditions of the form: ((^i)>iM((^2)>i- 
It is not hard to see that again, via boolean manipulations and complementation of au- 
tomata, we can convert any qualitative query to a number of queries of the form: 



where $ and ^ are sets of w-regular properties. It thus suffices to consider only these 
qualitative queries. 

In the next sections we study how to decide various classes of multi-objective queries, 
and how to approximate the Pareto curve for properties Let us observe here a difficulty 
that we will have to deal with. Namely, in general we will need both randomization and 
memory in our strategies in order to satisfy even simple qualitative multi-objective queries. 
Consider the MDP, M', shown in Figure [21 and consider the conjunctive query: B = 
Pr^(nC>Pi) > A Pr^(nOP2) > 0. It is not hard to see that starting at state u in M' 
any strategy a that satisfies B must use both memory and randomization. Each predicate 
in B can be satisfied in isolation (in fact with probability 1), but, with a memoryless or 
deterministic strategy, if we try to satisfy 00^2 with non-zero probability, we will be forced 
to satisfy DOPi with probability 0. Note, however, that we can satisfy both with probability 
> using a strategy that uses both memory and randomness: namely, upon reaching the 
state labeled Pi for the first time, with probability 1/2 we use move a and with probability 
1/2 we use move b. Thereafter, upon encountering the state labeled Pi for the nth time, 
n > 2, we deterministically pick action a. This clearly assures that both predicates are 
satisfied with probability = 1/2 > 0. 



{^l)>r^Mi{ip2)>r2 
{ip2)>r2M2{ip3)>r3 



{ipi)>r. Ml > M2 {ip3) 



3a /\ (PrS(^) = 1) A /\ (Pr-(V') > 0) 
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We note that our results (combined with the earher results of |CY98j ) imply that 
for general multi-objective queries a randomized strategy with a finite amount of memory 
(which depends on the MDP and query) does suffice to satisfy any satisfiable quantitative 
multi-objective w-regular query. 

3. MULTI-OBJECTIVE REACHABILITY 

In this section, as a step towards quantitative multi-objective model checking problems, 
we study a simpler multi-objective reachability problem. Specifically, we are given an MDP, 
M = {V, r, (5), a starting state u, and a collection of target sets Fi C. V, i = 1, . . . ,k. The 
sets Fi may overlap. We have k objectives: the i-th objective is to maximize the probability 
of OFi, i.e., of reaching some state in Fi. We assume that the states F = [J^^i Fi are all 
absorbing states with a self-loop. In other words, for all f € F, {v, a, l,v) & S and = {a}. 
(The assumption that target states are absorbing is necessary for the proofs in this section, 
but it is not a restriction in general for our results. It will follow from the model checking 
results in Section [5l which build on this section, that multi-objective reachability problems 
for arbitrary target states (whether absorbing or not) can also be handled with the same 
complexities.) 

We first need to do some preprocessing on the MDP, to remove some useless states. 
For each state v V \ F we can check easily whether there exists a strategy a such that 
Pr!^{C'F) > 0: this just amounts to checking whether there exists a path from u to F in 
the underlying graph of the MDP, i.e., the graph given by considering only the non-zero- 
probability transitions. Let us call a state that does not satisfy this property a bad state. 
Clearly, for the purposes of optimizing reachability objectives, we can compute and remove 
all bad states from an MDP. Thus, it is safe to assume that bad states do not exist Q Let 
us call an MDP with goal states F cleaned-up if it does not contain any bad states. 

Proposition 3.1. For a cleaned-up MDP, an initial distribution a € ViV \ F), and a 

vector of probabilities r G [0,1]'^, there exists a (memoryless) strategy a such that 

k 
i=l 

if and only if there exists a ( respectively, memoryless ) strategy a' such that 

k 

/\ PiiiOF,) >ri A /\ Pr^'(OF) > 0. 

i=i vev 

Proof. This is quite obvious, but we give a quick argument anyway. Suppose we have such 
a strategy a. Since the MDP is cleaned-up, we know that from every state in V we can 
reach F with a positive probability. Suppose the strategy leads to a history whose last state 
is V € V\F, and that thereafter the strategy is such that it will never reach F on any path. 
We simply revise cr to a strategy a' such that, if we ever arrive at such a "dead" history, we 

-'^Technically, we would need to install a new "dead" absorbing state Vdead F, such that all the proba- 
bilities going into states that have been removed now go to Vdead- For convenience in notation, instead of 
explicitly adding Vdead we treat it as implicit: we allow that for some states v £ V and some action a £ r„ 
we have "^^i^y P(v,-t,v') < 1) s^nd we implicitly assume that there is an "invisible" transition to Vdead with 
the residual probability, i.e., with P(v,i,va^^j) = 1 — X^^'g^PC^.T."')- course, Vdead would then be a "bad" 
state, but we can ignore this implicit state. 
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Objectives {i = 1, . . . ,k): Maximize ^^^^ Uv', 
Subject to: 

E7Gr, Viv,-!) - J2v'ev E7'er„, P{v',j',v)y{v',Y) = aiv) for allv eV\F] 
Vv - T>v'eV\F T^yer^, P{v',-y',v)y{v',Y) = for all v e F; 

> for all v e F; 
y(v,j) ^0 for all € y \ F and 7 G r„ . 

Figure 3: Multi-objective LP for the multi-objective MDP reachability problem 

switch and play according to the memoryless strategy starting at v which reaches F with 
some positive probability. Note that if a is memoryless then so is <t'. □ 



Now, consider the multi-objective LP described in Figure [31 i The set of variables in this 
LP are as follows: for each v & F, there is a variable y^, and for each v (z V \ F and each 
7 G Ft, there is a variable y(v,'Y) ■ 

Theorem 3.2. Suppose we are given a cleaned-up MDP, M = {y,V^5), with multiple target 
sets Fi C V , i = I, . . . ,k, where every target v € F = |Ji=i Pi 0,''^ absorbing state. Let 
a € Viy \ F) be an initial distribution (in particular V \ F ^ %). Let r G (0, l]'^ be a vector 
of positive probabilities. Then the following are all equivalent: 
(1.) There is a (possibly randomized) memoryless strategy a such that 

k 

A iP^aiOF^) > ri) 
1=1 

(2.) There is a feasible solution y' for the multi- objective LP in Fig. [5| such that 

1=1 yveF^ 

(3.) There is an arbitrary strategy cr such that 

k 

/\ (PrS(OF,) > n) 
1=1 

Proof. 

(1.) (2.). Since the MDP is cleaned up, by Proposition 13.11 we can assume there is a 
memoryless strategy a such that Ai=i -P^o(^-^i) — G y Pr^iOF) > 0. Consider 

the square matrix whose size is |y\F|x|y\i<'|, and whose rows and columns are indexed 
by states in V\F. The (u, u')'th entry of P" , P^^i, is the probability that starting in state 
V we shall in one step end up in state v' . In other words, P^^/ = J^yer^ '^(''^)(7) ' Pv,-y,v'- 

For allveV\F, let y[^^^^ = Ev'eV\F «(^') En=oiPn:'[Mv){l)- In other words 
denotes the "expected number of times that, using the strategy a, starting in the distribution 
a, we will visit the state v and upon doing so choose action 7" . We don't know yet that these 



We mention without further elaboration that this LP can be derived, using complementary slackness, 
from the dual LP of the standard LP for single-objective reachability obtained from Bellman's optimality 
equations, whose variables are Xv, for v £ V, and whose unique optimal solution is the vector x* with 
xl =max^Pr^(OF) (see, e.g., [Put94l [CY98] ). 
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are finite values, but assuming they are, for v e F, let y'^ = Y,v'&v\F ^I'er^, Piv',Y,v)y[y'^y)- 
This completes the definition of the entire vector y'. 

Lemma 3.3. The vector y' is well defined (i.e., all entries y^^^^ are finite). 
Moreover, y' is a feasible solution to the constraints of the LP in Figure\^ 

Proof. First, we show that for v \ F and 7 £ F^, is a well defined finite value. 

It then also follows from the definition of y'^ that y'^ is also finite and thus that the vector 
y' is well defined. Note that because a has the property that G F Pr'^{OF) > 0, P" is 
clearly a substochastic matrix with the property that, for some power d > 1, all of the row 
sums of [P'^)'^ are strictly less than 1. Thus, it follows that lim„^oo(-P'^)"' — > 0, and thus 
by standard facts about matrices the inverse matrix (I — P'^)^^ = ^^^^iP^)"" exists and 
is non-negative. Now observe that 

y[v,i) = E c.{v')Y,{p'')i,.^{v){i) 

v'£V\F n=0 

00 

= o.{v')a{v){^)Y,{P'')l,. 

v'£V\F n=0 

= a{v){^) a{v'){I-P'^)-]^ 

v'&V\F 

Next, we show that y' is a feasible solution to the constraints in the multi-objective LP in 
Figure [3l Note that, for each state v gV \ F, the expression Ylv'^vYlyer , P{v',-y',v)y[y' y) 
is precisely the "expected number of times we will take a transition into the state if 
we start at initial distribution a and using strategy a, whereas J^'jeTv v'iv 7) defines pre- 
cisely the "expected number of times we will take a transition out of the state v". Thus 
a{v), the probability that we will start in state v, is precisely given by l^^gr„ ^(d 7) ~ 
Yjv'eV X]7'Gr„, P(D',7',-(;)2/{„',y) = More formally, for each state v eV\F: 

00 

v'&Vi'£T^, v'&V-i'£V^, v"eV\F n=0 



00 



= E "K)E E PK-7',^)E(^'')""y^(^')(^') 

v"&V\F v'&V-i'&V^i n=0 

00 

= E E(^'')"",- 

v"<^V\F n=l 

The last expression is easily seen to be the expected number of times we will transition 
into state v. It is clear by linearity of expectations that Yljer^ y[v 7) the expected 

number of times we will transition out of state v. It is thus clear that X^^gp^ — 
Ev'ev Eyer,, P{v',Y,v)y[^',Y) = a{v)- □ 

Now we argue that '^^^p^yi = Pr^(OFj). To see this, note that for v £ F, y[j = 
X^t)'ey\F ^Y&r I P{v' ,'y' .v)y[^' y) is precisely the ^''expected number of times that we will tran- 
sition into state v for the first time" , starting at distribution a. The reason we can say "for 
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the first time" is because only the states mV\F are included in the matrix P°'. But note 
that this italicised statement in quotes is another way to define the probability of eventually 
reaching state v. This equality can be establish formally, but wc omit the formal algebraic 
derivation here. Thus ^y^p. y'y = Pr^(OFi) > r^. We are done with (1.) (2.). 

(2.) (1.). We now wish to show that if y" is a feasible solution to the multi-objective LP 
such that J2veF, Vv ^ > 0; for all i = 1, . . . , fc, then there exists a memoryless strategy a 

such that Af=iPr^(0^i) > n- 

Suppose we have such a solution y" . Let 5 = {-y G F \ F | '^^^zy^ v'l^ ^) > 0}. Let a be 
the memoryless strategy, given as follows. For each v 

-(-)(7) := 

Note that since X^^gr^ v'lv 7) ^' '^k'A ^ well-defined probability distribution on the 
moves at state v G S. For the remaining states v G \ F) \ 5, let a{v) be an arbitrary 
distribution in P(r^). 

Lemma 3.4. This memoryless strategy a satisfies /\i=iPr^{'^Fi) > fi. 

Proof. Let us assume, for the sake of convenience in our analysis, that there is an extra 
dead-end absorbing state I'rfead ^ F available, and an extra move 'jdead available at each 
state, V, with P{v,'raead,'Vdead] ~ fo^ ^^^^ V e {V \ F)\ S, instead of letting a{v) be 

arbitrary, let o"(t;)(7cfead) = 1- In other words, from each such state we simply move directly 
to an absorbing dead-end which is outside of F. The assumption that such a dead-end 
exists is just for convenience: clearly, without such a dead-end, we can use any (mixed) 
move at such vertices in our strategy, and such a strategy would yield at least as high a 
value for Pr^(OFj), for all i = 1, . . . , fc. 

Let us now explain the reason why wc don't care about what moves are used at 
states outside S in the strategy a. Let support(a) = {v G V \ F \ a{v) > 0}. We 
claim S contains all states reachable from support(a) using strategy a. To see this, 
first note that support(a) C S, because for all v G support(a), since l^^gr^ ^ft, 7) ~ 
T.v'ev ^7'er„, P{v',^',v)y"v>^Y) = o:{v) and a{v) > 0, and since y'^y > for all v' G V\F and 
7' G Tyi, it must be the case that, X^^gr„ y'lvy) ^ ^' Thus support(Q) C S. Inductively, for 
A; > 0, consider any state v eV\F, such that we can, with non-zero probability, reach v in 
k steps using strategy a from a state in support(a), and such that wc can not reach v (with 
non-zero probability) in any fewer than k step. For the base case A; = 0, wc already know 
V G support(a) C S. For > 0, we must have a{v) = 0. But note that there must be a pos- 
itive probability of moving to v in one step from some other state v' which can be reached in 
k — 1 steps from support(Q;). But this is so if and only if for some 7' G r„/, both ,|,/ > 
and y'(^,,^Y) > ( and thus (r{v'){-f') > 0). Hence, Ei,'Gy Eyer^, P(f',7',«)y{'«',7') > 0- Thus 
since E7er, 2/('^,7) " Ev'eV I^yer,, P{v',y,v)y'(v',Y) = 0, we must have J^^^r, y'lv,^) > 0, and 
thus V e S. Hence S contains the set of nodes reachable from nodes in the support of the 
initial distribution, support(a), using the strategy a. 

We will now show that Pr^(OFj) > rj, for alH = 1, . . . ,k. Let us consider the underlying 
graph of the "flows" defined by y". Namely, let G = {V, E) be a graph on states of M such 
that (v, v') G if and only if there is some 7 G F^ such that ?/|'^ 7) ^ ^ P{v,'y,v') > 0. Let 
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W CV\F he the set of vertices mV\F that have a non-zero "flow" to F, i.e., is in 
iff there is a path in G from v to some vertex in F. 

For V e V\F, let Zv = Yl-yeVy v'lv 7)' ^o^e that by the constraints of the LP, for any 
vertex v £ S 

a{v) = z^- P{v',^,v)yiv',y) 
t)'ev\F7Gr„/ 

P{vi ,'y,v)y'{v' 7) (because all flow into v comes from S) 

E\ ^ // ^ 

H H P(^;',7,i;)^K)(7)^^;' 

t;'eS7er„, 
v'es 

Now, let us focus on the vertices in W. Note that, by definition, W C S. Consider the 
submatrix P^r^y obtained from P"^ by eliminating the rows and columns whose indices are 

not in W. Note that since there is no flow into a vertex in W from a vertex outside of W, 
the above equalities yield, for each v € W, a{v) = Zy — X^^g^F ^v' v^v- This can be written 
in matrix notation as a^\w = z^\w{I " Pw,w)- 

Now, note that since every vertex in W has a "flow" to F, in terms of the underlying 
Markov chain of the substochastic matrix Py/yy, this means that every vertex in W is 
transient, and that there is a power d > 1, such that {Py[/y/)'^ has the property that all 
its row sums are strictly less than 1. Consequently, \miii^oo{Pw w)'^ ~ ^ ^^"^ matrix 
{I — P^y^) is invertible, with (/ — P-^^^)~^ = X^^o(-^h/vf)*)' ^ nonnegative matrix. Thus, 

outside of W has a flow into that for each v ElW: 



— 



\w = {I — Pw,w) — ^ (Y^i^oi^wwy- tliis it follows, again because no vertex 



00 



v'€V\F n=0 7er„ 



00 



7er„ v'ew n=o 

(because all moves into v of strategy a come from vertices in W) 

= E y'iv,!) 

where, in the last expression, the values y'^^^-j, not to be mistaken with y'^^^y are values 
from the vector y' which wc obtained in the proof that (1.) =^ (2.), from a given memoryless 
strategy a. In this case, the strategy a in question is precisely the memoryless strategy we 
just defined based on y" . Thus, for allv^W: 

zv=Y y'Li) = E ^(^,7) (^-1) 

7er^ 7er„ 
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We next show that in fact for all v ^ W and 7 € Ft, , y^^ = y^^ . For v and 7 G F^ , 
we have: 

00 



00 y 



,,// 



But recah that the "expected number of times we will transition out of state v" is given by 

Hence y'^^^^^ = E7er„ ^(^,7)- Thus, by using equation ([33]) and cancehng, 

we get ?/|^ = . Thus, since y" is a feasible solution to the LP, we have that for any 
t; G F: ' 

v'GV\F7'Gr„, 

= Y^ Y^ P{v',Y,v)y(v' 7') (because all flow into F is from W) 

= 12 Yl Piv',Y,v)y[v',Y) 

The last equality holds because, as we showed in the proof of ((1.) =^ (2.)), the expression 

J2v'ewJ2Y(^r^,P{v',-(',v)y[y>^Y) = 11v'&v\f11y&t^,P(v' n' ,^)y[v' a') exactly the ''expected 
number of times that we will visit the vertex v G F for the first tim^\ which is precisely 
the probability Pr^(0{t;}). 

Thus, clearly, Yliv&F y'v — Yliv(^F- = P'^a(^-^i)- Thus, since we have assumed 

that X^^gi?^ y'.^ > ri, we have established that Pr^(<>Fi) > r^, for all target sets Fj. □ 

This completes the proof that (2.) (1.). 

(3.) ^ (1.). Clearly (1.) =^ (3.), so we need to show that (3.) =^ {!.). 

Let U be the set of achievable vectors, i.e., all fc- vectors r = (ri . . . r^) such that there 
is a (unrestricted) strategy a such that A!Li Pi'S(^-^j) — ^® be the analogous set 

where the strategy a is restricted to be a possibly randomized but memoryless (stationary) 
strategy. Clearly, U and C/® are both downward closed, i.e., if r > r' and r then also 
r' G [/, and similarly with C/®. Also, obviously C/® C U . We characterized C/® in (1.) 
(2.), in terms of a multi-objective LP. Thus, C/® is the projection of the feasible space of a 
set of linear inequalities (a polyhedral set), namely the set of inequalities in the variables 
y given in Fig. [3] and the inequalities Ylv&Fi yv ^ r^, i = 1, . . . , k. The feasible space is a 
polyhedron in the space indexed by the y variables and the r^'s, and U® is its projection on 
the subspace indexed by the rj's. Since the projection of a convex set is convex, it follows 
that C/® is convex. 
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Suppose that there is a point r G U\U®. Since U® is convex, this implies that there is a 
separating hyperplane (see, e.g., [GLS93j ) that separates r from C/®, and in fact since is 
downward closed, there is a separating hyperplane with non-negative coefficients, i.e. there 
is a non-negative "weight" vector w = {wi, . . . ,Wk) such that w^r = Yli=i > w^x for 
every point x € ?7®. 

Consider now the MDP M with the following undiscounted reward structure. There is 
reward for every state, action and transition, except for transitions to a state v £ F from 
a state in V \F; i.e. a reward is produced only once, in the first transition into a state of 
F. The reward for every transition to a state vGFisY^ {wi \ i G {!,... ,A;} & v G Fi\. By 
the definition, the expected reward of a policy a is 

Ei=i^^iPrS(^-P^)- From classical MDP 
theory, we know that there is a memoryless strategy (in fact even a deterministic one) that 
maximizes the expected reward for this type of reward structure. (Namely, this is a positive 
bounded reward case: see, e.g.. Theorem 7.2.11 in |Put94j .) Therefore, maxjw-^x | x G 
[/} = max{tt;-^x | x G C/®}, contradicting our assumption that uFr > max{t(;^3; | x G f/®}. 

□ 

Corollary 3.5. Given an MDP M = {V,T,5), a number of target sets Fi <^ V , i = 
1, . . . ,k + k' , such that every state v £ F = IJi^i^ "^^ absorbing, and an initial state u (or 
even initial distribution a G 'D{V) ): 

(a.) Given an extended achiev ability query for reachability, 3aB, where 

k k+k' 
f\{PrZ{oF,)>ri)A /\ (P<(OF,) > r,), 

i=l j=k+l 

we can in time polynomial in the size of the input, \M\ + \B\, decide whether 3a B is 
satisfiable and if so construct a memoryless strategy that satisfies it. 
(b.) For € > 0, we can compute an e-approximate Pareto curve V{e) for the multi-objective 
reachability problem with objectives OFi, i = 1, . . . ,k, in time polynomial in \M\ and 
l/e. 

Proof. For (a.), consider the constraints of the LP in Figure [3l and add the following 
constraints: for each i G {l,...,k} add the constraint "^veFiVv ^ '^j) ^^id for each j G 
{k + 1, . . . , k + k'}, add the constraint Vv > rj + z, where z is a new variable, 

and also add the constraint z > 0. Finally, consider the new objective "Maximize z" . 
Solve this LP to find whether an optimal feasible solution y*,z* exists, and if so whether 
z* > 0. If no solution exists, or if z* < 0, then the extended achievability query is not 
satisfiable. Otherwise, if z* > 0, then a strategy that satisfies 3aB exists, and moreover 
we can construct a memoryless strategy that satisfies it by using the vector y" = y* and 
picking the strategy a constructed from y" in the proof of (2.) =^ (1.) in Theorem 13.21 

Part (b.) is immediate from Theorem 13.21 and the results of [PYOO| . which show we 
can e-approximate the Pareto curve for multi-objective linear programs in time polynomial 
in the size of the constraints and objectives and in l/e. □ 

4. Qualitative multi-objective model checking 

Theorem 4.1. Given an MDP M , an initial state u, and a qualitative multi-objective query 
B, we can decide whether there exists a strategy a that satisfies B, and if so construct such 
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a strategy, in time polynomial in \M\, and using only graph-theoretic methods (in particular, 
without linear programming). 

Proof. By the discussion in Section [21 it suffices to consider the case where we are given an 
MDP, M, and two sets of LJ-regular properties <I>, ^, and we want a strategy a such that 

/\PrS(v.) = lA/\PrS(V)>0 

Assume the properties in <I>, ^' are ah given by (nondeterministic) Biichi automata Ai. We 
win use and build on results in |CY98j . In |CY98| (Lemma 4.4, page 1411) it is shown that 
we can construct from M and from a collection Ai, i = 1, . . . , m, of Biichi automata, a new 
MDP M' (a refinement of M) which is the "product" of M with the naive determinization 
of all the Ai^s (i.e., the result of applying the standard subset construction on each Ai, 
without imposing any acceptance condition). Technically, we have to slightly adapt the 
constructions of [CYO S] . which use the convention that MDP states are either purely con- 
trolled or purely probabilistic, to the convention used in this paper which combines both 
control and probabilistic behavior at each state. But these adaptations are straightfor- 
ward. For completeness, we recall the (adapted) formal definition of M'. The states of 
the MDP M' are tuples (x, zi, . . . , Zm), where x is a state of the MDP, M , and Zi is a set 
of states of Ai. The transition relation 5' of M' is as follows. There exists a transition 
{{x, Zi, Zm),a,p, (x' , z[, . . . , z!^)) G 5' if and only if the transition {x, a,p, x') is in M and, 
for each i = 1, . . . ,m, z[ is precisely the set of states in the Biichi automaton Ai that one 
could reach with one transition, starting from some state in the set Zi and reading the 
symbol l{x'). Technically, we also have to add a dummy initial state xq to the MDP, M, 
such that there is a single enabled action, 70, at xq, and such that there are transitions from 
xo on action 70 to other states according to some initial probability distribution on states, 
a G T)(y). Thus, in particular, if we assume there is just one initial state u in the MDP, M, 
then we would now have one transition (xo,7o, 1,^) G 5 in the new M with added dummy 
state xq. The reason for adding the dummy xq is because our definition of the product M' 
does not use the label of the initial state in defining the transitions of M' . We also assume, 
w.l.o.g., that each Biichi automaton Ai has a single initial state Sq. In this way, the initial 
state of M' becomes the tuple vq = (xq, {sq], ■ ■ ■ , {•S(f })• 

By Lemma 4.4 and 4.5 of |CY98j . this MDP M' has the following two properties. For 
every subset R of ^ U there is a subset Tr of corresponding "target states" of M' (and 
we can compute this subset efficiently, in time polynomial in the size of M') that satisfies 
the following two conditions: 

(I) If a trajectory of M' hits a state in Tr at some point, then we can apply from that 
point on a strategy hr (which is deterministic but uses memory) which ensures that 
the resulting infinite trajectory satisfies all properties in R almost surely (i.e., with 
conditional probability 1, conditioned on the initial prefix that hits Tr). 

(II) For every strategy, the set of trajectories that satisfy all properties in R and do not 
infinitely often hit some state of Tr has probability 0. 

We now outline the algorithm for deciding qualitative multi-objective queries. 

(1) Construct the MDP M' from AI and from the properties $ and ^ (in other words, 
using one automaton for each property in <I> and one for each property in 
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(2) Compute T$, and compute for each property ^pi G ^ the set of states Tr. where Ri = 

(3) If $ 7^ 0, prune M' by identifying and removing all "bad" states by applying the 
following rules. 

(a) All states v that cannot "reach" any state in T$ are "bad"@ 

(b) If for a state v there is an action 7 € F^, such that there is a transition {v, 7,p, v') G 
5' , p > 0, and v' is bad, then remove 7 from r^,. 

(c) If for some state v, = 0, then mark v as bad. 

Keep applying these rules until no more states can be labelled bad and no more actions 
removed for any state. 

(4) Restrict M' to the reachable states (from the initial state vq) that are not bad, and 
restrict their action sets to actions that have not been removed, and let M" be the 
resulting MDP. 

(5) If (M" = or Bipi e ^ such that M" does not contain any state of Tr^ ) 

then return No. 
Else return Yes. 

Correctness proof: In one direction, suppose there is a strategy a such that Ai^^g* '^^ui^) — 
1 A/\^g^ Pr^('(/') > 0. First, note that there cannot be any finite prefix of a trajectory under 
a that hits a state that cannot reach any state in T^. For, if there was such a path, then all 
trajectories that start with this prefix would go only finitely often through T$. Hence (by 
property (II) above) almost all these trajectories do not satisfy all properties in <I>, which 
contradicts the fact that all these properties have probability 1 under a. From the fact that 
no path under a hits a state that cannot reach r$, it follows by an easy induction that no 
finite trajectory under a hits any bad state. That is, under a all trajectories stay in the 
sub-MDP M" . Since every property ipi € ^ has probability Pr^(^j) > and almost all 
trajectories that satisfy ipi and ^ must hit a state of Tr. (property (II) above), it follows 
that M" contains some state of Tr. for each ipi G ^. Thus the algorithm returns Yes. 

In the other direction, suppose that the algorithm returns Yes. First, note that for all 
states V of M", and all enabled actions 7 E F^ in M", all transitions {v, 7,p, v') G 5, p > of 
M' must still be in M" (otherwise, 7 would have been removed from F^ at some stage using 
rule 3(b)). On the other hand, some states may have some missing actions in M" . Next, 
note that all bottom strongly connected components (BSCCs) of M" (to be more precise, 
in the underlying one-step reachability graph of M") contain a state of r$ (if $ = then 
all states are in T$), for otherwise the states in these BSCCs would have been eliminated 
at some stage using rule 3(a). 

Define the following strategy a which works in two phases. In the first phase, the 
trajectory stays within M" . At each control state take a random action that remains in 
M" out of the state; the probabilities do not matter, we can use any non-zero probability 
for all the remaining actions. In addition, at each state, if the state is in T$ or it is in 
Tfj. for some property ipi G ^, then with some nonzero probability the strategy decides to 
terminate phase 1 and move to phase 2 by switching to the strategy fi^ or respectively, 

Actually these sets can all be computed together: we can compute maximal closed components of the 
MDP, determine the properties that each component favors (see Def. 4.1 of |CY98) ). and tag each state 
with the sets for which it is a target state. 

^By "reach" , we mean that starting at the state v — vo, there a sequence of transitions (wi , 7, pi , Wi+i ) G 5, 
Pi > 0, such that u„ G r<i. for some n > 0. 
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which it apphes from that point on. (Note: a state may belong to several Tr-'s, in which 
case each one of them gets some non-zero probability - the precise value is unimportant.) 

We claim that this strategy a meets the desired requirements - it ensures probability 
1 for all properties in <I> and positive probability for all properties in ^. For each ipi G ^, 
the MDP M" contains some state of Tr.; with nonzero probability the process will follow 
a path to that state and then switch to the strategy fiji- from that point on, in which case 
it will satisfy ipi (property (I) above). Thus, all properties in ^ are satisfied with positive 
probability. 

As for <I> (if $ 7^ 0), note that with probability 1 the process will switch at some point 
to phase 2, because all BSCCs of M" have a state in T$. When it switches to phase 2 it 
applies strategy /x$ or fiR. for some i?i = $ U {tpi}, hence in either case it will satisfy all 
properties of $ with probability 1. □ 



5. Quantitative multi-objective model checking. 

Theorem 5.1. 

(1.) Given an MDP M , an initial state u, and a quantitative multi- objective query B, we 

can decide whether there exists a strategy a that satisfies B, and if so construct such a 

strategy, in time polynomial in \M\. 
(2.) Moreover, given to-regular properties $ = {(pi, . . . , <pk), we can construct an e-approxi- 

mate Pareto curve Pmu,^{^), for the set of achievable probability vectors Umu,^ time 

polynomial in M and in 1/e. 

Proof. For (1.), by the discussion in Section [21 we only need to consider extended achiev- 
ability queries, B = Ai=i ^ "^i ^ l\j=k'+i^'^u{'^j) > '''j' where k > k' > 0, and for 

a vector r G (0,1]*'. Let $ = {ipi,... ,(pk)- We are going to reduce this multi-objective 
problem with objectives $ to the quantitative multi-objective reachability problem studied 
in Section [3l From our reduction, both (1.) and (2.) will follow, using Corollary 13.51 As in 
the proof of Theorem 14. H we will build on constructions from |CY98] : form the MDP M' 
consisting of the product of M with the naive determinizations of the automata Ai for the 
properties ipi £ ^. For each subset i? C $ we determine the corresponding subset Tr of 
target states in M'0 

Construct the following MDP M" . Add to M' a new absorbing state s/j for each subset 
i? of <I>. For each state u of M' and each maximal subset R such that u € Tr add a new 
action 7^ to F^, and a new transition {u,'yR,l, sr) to 6. With each property G <I> we 
associate the subset of states Fi = {sr \ fi € R}. Let F = {OFi, . . . , OFk). Let u* be the 
initial state of the product MDP M" , given by the start state u oi M and the start states 
of all the naively determinized Aj's. Recall that Um^,^ ^ [Oj 1]^ denotes the achievable set 
for the properties ^ in M starting at u, and that Uj^j,, p denotes the achievable set for F 
in M" starting at u*. 

Lemma 5.2. Umu,^ = ^m" f- Moreover, from a strategy a that achieves r in Umu,^j '^^ 
can recover a strategy a' that achieves r in U^^i, -p, and vice versa. 



^Again, we don't need to compute these sets separately. See Footnote |3l 
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Proof. One direction is easy. Given such a strategy a' in M" , we fohow in M' (and in 
M) the same strategy (of course, only the first component of states of M" matters in 
M), until just before it transitions to a state sr, at which point it must be in Tr, and at 
that point our strategy a switches to the strategy fiR. This guarantees, for every ipi G <1>, 
PrZ{ipi)>Pv^^.{OFi)>n. 

For the other direction, suppose that the claim is not true, i.e. there is a strategy a 
in M which ensures probability PT"{(pi) > r,, i = 1, . . . , /c, but r Uj^j,, -p. Note that all 

states in F = U^^^^Fj are absorbing. From Theorem 13.21 we know that Uj^j/, -p = [/^„ 
where U^,, is the set of value vectors achievable by memoryless strategies. Recall, that 

It* ' 

^M" F ~ ^fr" convex, and that it is downward-closed. Since r U^j„ -p, as in the 
proof of (3.) ^ (1.) in Thm. 13.21 there must be a separating hyperplane, i.e., a non- 
negative weight vector w = {wi, . . . , Wk) such that w'^r = Yli=i ''^i^i > w^x for every point 
X ^ ^M''.F- 

Consider AI with the following reward structure, denoted rew{w): a trajectory r of M 
receives reward | t satisfies ipi\. This is not the traditional type of reward structure 

where reward is obtained at the states and transitions of the trajectory; it is obtained 
only at infinity when the trajectory has finished and we get a reward that depends on the 
properties that were satisfied. In [CY98j optimization of the expected reward for MDPs 
with this kind of reward structure was studied and solved by reducing the problem to 
an MDP with a classical type of reward. We reuse that construction here. Consider the 
MDP M" augmented with a traditional type of reward structure, denoted rew" , in which 
each transition of the form {u,^r, 1,sr) produces reward Yll'^^i I fi ^ while all other 
transitions (and states and actions) give reward. Let M" be a subMDP of M" that 
contains for each state u only one (at most) transition of the form (u, 7/j, 1, sr), namely 
the one that produces the maximum reward (breaking ties arbitrarily). Clearly, there is no 
reason ever to select from a state u any transition (u, jji' , 1, sri) that produces lower reward, 
thus, M" and M" have the same optimal expected reward. It is shown in [CY98j that the 
optimal expected rewards in {M,rew{w)) and {M" ,rew"), and thus also in {M" ,rew"), are 
equal to each other. Moreover, the optimum value in these MDPs is achievable, i.e., there 
are optimal strategies, and in fact a deterministic finite-memory optimal strategy can be 
constructed. 

The optimal expected reward in {M,rew{w)) is at least w'^r (because strategy a 
achieves w'^r), while the optimal expected reward in {M" ,rew") is equal to maxjtu^x | 
X S f^jvf" p}i because rewards are only obtained by transitioning to a state in F. There- 

u* ' 

fore, w'^r < m.ayi{w'^x \ x G Uj^.j,, -p}, contradicting our hypothesis that w^r > max{ii;-^x | 
X G Uj^,j„ p}. □ 

u* ' 

It follows from the lemma that: there exists a strategy cr in M such that 

k' k 
/\Pil{^,)>nA /\ Pil{^,)>rj 



20 



K. ETESSAMI, M. KWIATKOWSKA, M. Y. VARDI, AND M. YANNAKAKIS 



if and only if there exists a strategy a' in M" such that 

k' k 
|\Y>Tl,{OF.l)>n^ l\ Pr^.(OF,)>r,. 

1=1 j=k'+l 

Moreover, such strategies can be recovered from each other. Thus (1.) and (2.) follow, 
using Corollary 13. 5i □ 

6. Concluding remarks 

We mention that recent results by Diakonikolas and Yannakakis |DY08] provide im- 
proved upper bounds for appoximation of convex Pareto curves, and for computing a 
smallest such approximate convex Pareto set. These results yield significantly improved 
algorithms, particularly in the bi-objective case, for the multi-objective LP problem, and 
thus also for the multi-objective MDP problems studied in this paper. In particular, in the 
bi-objective MDP case, |DY08| provides a polynomial time algorithm to compute a minimal 
e-approximate (convex) Pareto set (i.e., one with the fewest number of points possible). 

We mention that, although we use LP methods to obtain our complexity upper bounds, 
in practice there is a way to combine other efficient iterative methods used for solving MDPs, 
e.g., based on value iteration or policy (strategy) iteration, with our results in order to ap- 
proximate the Pareto curve for multi-objective model checking. This is because the results 
of [PYOOl IDYOSj for multi-objective convex optimization problems only require a black- 
box routine that optimizes (exactly or approximately) positive linear combinations of the 
objectives. Specifically, in our setting the multiple MDP objectives ask to optimize the 
probabilities of different linear-time w-regular properties. By using the results in |CY98] . 
it is possible to reduce the problem of optimizing such positive linear combinations to the 
problem of finding the optimal expected reward for a new MDP with positive rewards. The 
task of computing or approximating this optimal expected reward can be carried out using 
any of various standard iterative methods, e.g., based on value iteration and policy iteration 
(see |Put94| ) . These can thus be used to answer (exactly or approximately) the black-box 
queries required by the methods of [PYOOt IDY08| , thereby yielding a method for approxi- 
mating the Pareto curve (albeit, without the same theoretical complexity guarantees). 

An important extension of the applications of our results is to extend the asymmetric 
assume-guarantee compositional reasoning rule discussed in Section [2] to a general compo- 
sitional framework for probabilistic systems. It is indeed possible to describe symmetric 
assume-guarantee rules that allow for general composition of MDPs. A full treatment of 
the general compositional framework requires a separate paper, and we plan to expand on 
this in follow-up work. 
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